# Loading libraries:
suppressMessages({
library(ALDEx2)
library(data.table)
library(dplyr)
library(IRdisplay)
library(KEGGREST)
library(MGnifyR)
library(pathview)
library(tidyjson)
})
#display_markdown(file = 'assets/mgnifyr_help.md')
Pathways Visualisation
You can run and edit these examples interactively on Galaxy
MGnifyR is a library that provides a set of tools for easily accessing and processing MGnify data in R, making queries to MGnify databases through the MGnify API. The benefit of MGnifyR is that data can either be fetched in tsv format or be directly combined in a phyloseq object to run an analysis in a custom workflow.
This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.
Pathways Visualisation
In this notebook we aim to demonstrate how the MGnifyR tool can be used to fetch functional annotation results generated through the MGnify metagenomic analyisis pipelines.
Then we show how to generate the pathways visualization using Pathview in R with the following input data:
- Annotated functional genes MGnify
- Annotated metabolic compounds Metabolights
Contents
- Introduction
- Part 1. Drawing presence/absence of KOs for one metagenomic sample
- Part 2. Comparing groups of samples, drawing KOs abundance
- Part 3. Plotting presence/absence of KOs and metabolites for one sample
- References
# Setting tables and figures size to display (these will be reset later):
options(repr.matrix.max.cols=150, repr.matrix.max.rows=500)
options(repr.plot.width=4, repr.plot.height=4)
# Setting up functions
<- function(ids_list) {
collect_pathways = list()
pathways for (id in ids_list) {
= as.list(keggLink("pathway", id))
current_pathway for (index in grep("map", current_pathway)) {
= gsub("*path:", "", current_pathway[index])
clean_id # Discarding chemical structure (map010XX), global (map011XX), and overview (map012XX) maps
= substring(clean_id, 1, 6)
prefix if(is.na(match("map010", prefix)) & is.na(match("map011", prefix)) & is.na(match("map012", prefix)) ){
= append(pathways, clean_id)
pathways
}
}
}return(pathways)
}
# Create your session mgnify_client object
= mgnify_client(usecache = T, cache_dir = '/home/jovyan/.mgnify_cache') mg
Introduction
The goal of this notebook is to demonstrate how to create KEGG pathway maps to visualise metabolic potential and metabolite production in metagenomic samples. We will use metabolic pathways annotated at the gene level by assigning a KEGG Orthology (KO) to putative protein sequences. These results are generated through the MGnify v5.0 pipeline for metagenomic assemblies, as shown in the workflow schema below. We will also use the completeness estimation of KEGG modules available in the MGnify web portal. Consider that modules completeness is determined by the shortest possible path which contains the essential steps for the pathway; therefore, even if 100% completeness is achieved, there can still be gaps.
The KEGG (Kyoto Encyclopedia of Genes and Genomes) database is a collection of biological databases, including genetic and metabolic pathways, diseases and drugs, protein-protein interactions, and gene expression. We use the KEGG database to make connections between genes and their biological functions, and define pathways containing the functions to be used as a resource for systems biology.
Some key concepts we will use in this notebook are:
KEGG Orthologs (KO): Orthologs are genes in different species that evolved from a common ancestral gene and typically retain similar functions. KEGG Orthologs are a set of manually curated genes and proteins grouped into categories based on their functional similarities, particularly their roles in biological pathways and molecular functions. KEGG Orthologues IDs are unique and always starts with the letter ‘K’ (uppercase) followed by 5 numbers.
KEGG Modules: Modules are clusters of related genes that are involved in a specific biological process or pathway. KEGG modules IDs starts with ‘M’ followed by 5 numbers.
KEGG Pathways: Pathways are sets of interconnected biochemical reactions that form a chain leading from an initial reactant to a final product. KEGG pathways provide a high-level overview of the major metabolic pathways in an organism, while KEGG modules provide a more detailed view of the genes and reactions involved in a specific pathway. The IDs of manually drawn reference pathway starts with the word ‘map’ followed by 5 numbers. Another pathways prefix you will se in the notebook is ‘ko’ for reference pathway highlighting KO.
For a better display of results, in this notebook we are not using as templates chemical structure (map010XX), global (map011XX), and overview (map012XX) maps.
In the following sections of this introduction you will find a couple of simplest minimal examples on the main functions we will use to fetch and format the input tables for KEGG pathways visualization.
Minimal example 1. Fetching KEGG orthologs and modules tables from MGnify
MgnifyR
has pre-built functions to retrieve data from MGnify databases. The mgnify_retrieve_json
function can be used to access results that are not available in tabular format. In this notebook we will fetch the pathways annotation and pathways completeness tables generated by the latest version of the MGnify analysis pipeline in json format and reformat into dataframes to easily manipulate the data.
- Example on how to fetch the KOs counts table for one sample having the analysis accession
MGYA00636312
:
head(ko_data, 3)
- And we can also fetch the modules completeness table for the same sample using the following code:
= mgnify_retrieve_json(mg, path = 'analyses/MGYA00636312/kegg-modules')
ko_comp_json = as.data.frame(ko_comp_json %>% spread_all)[ , c("id", "attributes.completeness")] ko_comp
head(ko_comp, 3)
Minimal example 2. Accessing to KEGG pathways info using the KEGGREST API
KEGGREST is a powerful tool with multiple functions to access and utilize data from KEGG databases. We can use KEGGREST
to map between IDs in different databases and to find pathways associated with a module, orthologue, or compound. To display the list of KEGG databases we can query, you can run listDatabases()
command.
- With
keggLink
function we can find the ID of the pathways to use as template to draw the module of methane production from CO2 (M00567). We find that the module M00567 is present in 4 possible pathways.
as.list(keggLink("pathway", 'M00567'))
- Using
keggFind
to find all the KOs associated to the Methyl-coenzyme M reductase (EC 2.8.4.1)
as.list(keggFind("ko", '2.8.4.1'))
Part 1. Drawing presence/absence of KOs for one metagenomic sample
For Parts 1 and 2 of this notebook, we will use MGnify results generated for two studies: 1. Metagenomes of bacteria colonizing the gut of Apis mellifera and Apis cerana from Japan (MGYS00006180). 2. Gut microbiota of Switzerland honeybees (MGYS00006178).
The original analysis based on viral communities can be found in the Bonilla-Rosso et al., publication.
1.1. Fetching data from MGnify
- Fetching the analysis accession list using the study accessions.
- Keeping just the first analysis accession to fetch the kegg orthologs count table from the MGnify API and transform from JSON to matrix.
= head(all_accessions, 1)
accession = paste0('analyses/',accession,'/kegg-orthologs') ko_loc
= mgnify_retrieve_json(mg, path = ko_loc)
ko_json = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]
ko_data = data.frame(ko_data, row.names=1)
ko_data colnames(ko_data)[1] = 'counts'
= data.matrix(ko_data) ko_matrix
head(ko_matrix, 3)
1.2. Selecting pathways
- We are going to draw metabolic pathways for essential metabolism of carbohydrates plus fructose and manose metabolism, which is non-essential:
- Glycolysis/Gluconeogenesis (00010)
- Citric Acid (TCA) Cycle (00020)
- Pentose Phosphate Pathway (00030)
- Fructose and manose metabolism (00051)
1.3. Ready to draw!
- As we are plotting absence/presence, we set the number of bins = 2, the scale in one direction, and use 1 as limit. You will see some ‘Note’ and ‘Info’ messages, don’t panic, it is pathview normal behaviour.
- Cleaning the working directory.
if(!dir.exists("output_plots")){
dir.create("output_plots")
dir.create("output_plots/single_sample")
}
file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/single_sample/", overwrite = TRUE)
= list.files(path = ".", pattern = "*.png")
png_files = list.files(path = ".", pattern = "*.xml")
xml_files = c(png_files, xml_files)
files unlink(files)
- This is one example of Pathview outputs. The rest of the generated figures are stored at
output_plots/single_sample/
directory. You can explore the files by clicking on them at the left-side panel.
display_png(file='./output_plots/single_sample/ko00030.pathview.png')
Part 2. Comparing groups of samples, drawing KOs abundance
In this part of the notebook we use the whole list of accessions for both studies (MGYS00006180
and MGYS00006178
). After integrating the KO’s tables, we run Aldex2 to find differentially abudant genes with KO annotation between honeybees from Japan and Switzerland, and we use the effect
as scale to plot the pathways with the highest number of differentially abundant KOs. Consider that steps involving fetching KO tables and KEGGREST
queries can take few minutes to run.
2.1. Fetching KO tables from MGnify
- Generating condition labels.
= list()
cond_list for (study_id in accession_alias$'study_attributes.accession') {
if (study_id == 'MGYS00006180') {
= append(cond_list , 'Japan')
cond_list else {
} = append(cond_list , 'Switzerland')
cond_list
}
}$condition = cond_list accession_alias
table(unlist(accession_alias$condition))
- Selecting randomly 15 samples from each location to speed up the downstream analysis.
set.seed(123)
= accession_alias %>% group_by(condition)
grouped_df = grouped_df %>% sample_n(size = 15, replace = FALSE) random_sample
- Download and integrate KO counts tables. We will discard any sample having zero genes with KO annotation. This step takes around 3 minutes to complete.
# Saving KO tables as a list of dataframes
= random_sample$'analysis_accession'
samples_list = list()
list_of_dfs for (accession in samples_list) {
= paste0('analyses/',accession,'/kegg-orthologs')
ko_loc = mgnify_retrieve_json(mg, path = ko_loc)
ko_json = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]
ko_data colnames(ko_data) = c('ko_id', accession)
= append(list_of_dfs, list(ko_data))
list_of_dfs }
# Integrating all dataframes into a single dataframe
= list_of_dfs[[1]]
integrated_df for (i in 2:length(list_of_dfs)){
= merge(integrated_df,list_of_dfs[[i]], all = T)
integrated_df
}
## Cleaning the integrated table
# Using the KO id column as row names
row.names(integrated_df) = integrated_df$ko_id
$ko_id = NULL
integrated_df
# Converting NA to zero
is.na(integrated_df)] = 0
integrated_df[
# Discarding samples that KOs abundance sum = 0
= integrated_df %>% select_if(is.numeric) %>% select_if(~ sum(. != 0) > 0) integrated_df
head(integrated_df, c(3, 2))
- Reformating the condition label according with the KOs dataframe.
= list()
sorted_conds for (sample in colnames(integrated_df)) {
= accession_alias[accession_alias$analysis_accession %in% sample,]$condition
match = paste(match, collapse = "")
cond = append(sorted_conds, cond)
sorted_conds
}= unlist(sorted_conds) vector_conds
table(vector_conds)
2.2. Generating differentially abundance count tables
- Running Aldex step. It takes 2 minutes to complete.
head(x.all, 3)
- The column
effect
in the above output (x.all
table) contains the log ratio of the sample mean to the reference mean. A positive effect indicates that the sample mean is greater than the reference mean, while a negative effect indicates that the sample mean is lower than the reference mean. In this example, Japan group is used as reference. We will generate now the matrix of KO-effect for plotting.
= data.matrix(subset(x.all, select = c('effect'))) ko_matrix
head(ko_matrix, 3)
- Plotting the effect size (
effect
) and difference (diff.btw
) to show an overview of differentially abundant functions. Two type of P-value results from the Welch’s t test are shown: In blue marks the expected P-value; in red marks the expected Benjamini-Hochberg corrected P-value. The threshold line indicates P-value = 0.05.
options(repr.plot.width=10, repr.plot.height=8)
# Effect size plot
par(mfrow=c(1,2))
plot(x.all$effect,
$we.ep,
x.alllog="y",
cex=0.7,
col=rgb(0,0,1,0.2), # Blue marks for expected P value of Welch’s t test
pch=19,
xlab="Effect size",
ylab="P value",
main="Effect size plot")
points(x.all$effect,
$we.eBH,
x.allcex=0.7,
col=rgb(1,0,0,0.2), # Red marks for expected Benjamini-Hochberg corrected P value of Welch’s t test
pch=19)
abline(h=0.05, lty=2, col="grey")
legend(-0.5,0.0005, legend=c("P value", "BH-adjusted"), pch=19, col=c("blue", "red"))
# Volcano plot
plot(x.all$diff.btw,
$we.ep,
x.alllog="y",
cex=0.7,
col=rgb(0,0,1,0.2), # Blue marks for expected P value of Welch’s t test
pch=19,
xlab="Difference",
ylab="P value",
main="Volcano plot")
points(x.all$diff.btw,
$we.eBH,
x.allcex=0.7,
col=rgb(1,0,0,0.2), # Red marks for expected Benjamini-Hochberg corrected P value of Welch’s t test
pch=19)
abline(h=0.05, lty=2, col="grey")
legend(-2,0.0005, legend=c("P value", "BH-adjusted"), pch=19, col=c("blue", "red"))
- Reporting features detected by the Welchs’ or Wilcoxon test individually (blue) or by both (red).
options(repr.plot.width=8, repr.plot.height=8)
<- which(x.all$we.eBH < 0.05 & x.all$wi.eBH < 0.05)
found.by.all <- which(x.all$we.eBH < 0.05 | x.all$wi.eBH < 0.05)
found.by.one
plot(x.all$diff.win, x.all$diff.btw, pch=19, cex=1, col=rgb(0,0,0,0.3),
xlab="Dispersion", ylab="Difference")
points(x.all$diff.win[found.by.one], x.all$diff.btw[found.by.one], pch=19,
cex=1, col=rgb(0,0,1,0.5))
points(x.all$diff.win[found.by.all], x.all$diff.btw[found.by.all], pch=19,
cex=1, col=rgb(1,0,0,1))
abline(0,1,lty=2)
abline(0,-1,lty=2)
2.3. Selecting pathways to draw
- We will use the union of both testing methods (Welchs’ or Wilcoxon) to find the pathways with differentially abundant KOs.
= collect_pathways(kos_list) ko_pathways
head(ko_pathways)
- Finding the pathway with the highest number of significant KOs.
= list()
pathways_counts for (path_element in ko_pathways) {
if (path_element %in% names(pathways_counts)) {
= pathways_counts[[path_element]] + 1
new_value = new_value
pathways_counts[path_element] else {
} = 1
pathways_counts[path_element]
} }
= names(tail(pathways_counts[order(unlist(pathways_counts))], 1))
top_to_plot top_to_plot
2.4. Drawing pathways!
- For this type of data, values range from -1 to 1, with both negative and positive fractions. We use both directions when plotting.
- Cleaning the working directory.
if(!dir.exists("output_plots/comparative")){
dir.create("output_plots/comparative")
}
file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/comparative/", overwrite = TRUE)
= list.files(path = ".", pattern = "*.png")
png_files = list.files(path = ".", pattern = "*.xml")
xml_files = c(png_files, xml_files)
files unlink(files)
- This is the Pathview output stored at
output_plots/comparative/
directory.
display_png(file='./output_plots/comparative/ko00860.pathview.png')
Part 3. Plotting presence/absence of KOs and metabolites for one sample
For this part of the notebook we will use public data of the study An untargeted metabolomics approach to characterize dissolved organic matter in groundwater of the Samail Ophiolite. Metatranscriptomics (RNA-Seq) raw-reads are available at the ENA under the accession PRJNA560313. Analysis results are available on MGnify under the study accession MGYS00006297, and compound data can be found on Metabolights under the accession MTBLS6081.
3.1. Fetching KOs table from MGnify
- As we are using only one sample, we will fetch metadata and the KO table for that sample directly from MGnify.
= paste0('analyses/',sample,'/kegg-orthologs')
ko_loc
= mgnify_retrieve_json(mg, path = ko_loc)
ko_json = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]
ko_data = data.frame(ko_data, row.names=1)
ko_data colnames(ko_data)[1] = 'counts'
# Transforming kos matrix into a named vector
= ko_data$'counts'
ko_values = rownames(ko_data)
ko_names = setNames(ko_values, ko_names)
ko_vector
head(ko_vector, 3)
- Find the pathways with at least 10 genes. This step takes 3 min to run.
= collect_pathways(ko_names) ko_pathways
= list()
pathways_counts for (path_element in ko_pathways) {
if (path_element %in% names(pathways_counts)) {
= pathways_counts[[path_element]] + 1
new_value = new_value
pathways_counts[path_element] else {
} = 1
pathways_counts[path_element]
}
}= names(pathways_counts)[sapply(pathways_counts, function(x) x > 9)]
top_genes_pathways print(length(top_genes_pathways))
#print(top_genes_pathways)
3.2. Fetching and formatting metabolites data
- We will fetch the samples table s_MTBLS6081.txt and the compounds table m_MTBLS6081_LC-MS_negative_reverse-phase_metabolite_profiling_v2_maf.tsv from the Metabolights database using the url and save them on files with shorter names.
# Downloading the metabolites samples table = 'https://www.ebi.ac.uk/metabolights/ws/studies/MTBLS6081/download/39846581-6c64-4e09-99dd-74cb05921f85?file=s_MTBLS6081.txt' url = 's_MTBLS6081.txt' destfile download.file(url, destfile = destfile, method = "auto") # Downloading the compounds table = 'https://www.ebi.ac.uk/metabolights/ws/studies/MTBLS6081/download/39846581-6c64-4e09-99dd-74cb05921f85?file=m_MTBLS6081_LC-MS_negative_reverse-phase_metabolite_profiling_v2_maf.tsv' url = 'm_MTBLS6081.tsv' destfile download.file(url, destfile = destfile, method = "auto")
- Finding the sample ID corresponding to MGnify results on the metadata table
= read.table("s_MTBLS6081.txt", header = TRUE, sep = "\t")
metadata_comp = metadata_comp[grepl(mgnify_sample, metadata_comp$'Source.Name'), ]$'Sample.Name'
metabolights_sample print(c(mgnify_sample, metabolights_sample))
- Parsing and formating the metabolites table
= read.table("m_MTBLS6081.tsv", header = TRUE, sep = "\t", comment.char = "|", quote = "")
compound_data head(compound_data, 3)
# Keeping rows with CHEBI id
= compound_data[grepl("CHEBI", compound_data$database_identifier), ]
chebi_compound
# Keeping data for the sample of interest only
= subset(chebi_compound, select = c('database_identifier', metabolights_sample))
chebi_compound
# Discarding rows with abundance = 0
= chebi_compound[chebi_compound[, 2] != 0, ]
compound_clean
head(compound_clean, 3)
- Transforming CHEBI ids into KEGG compound IDs. To speed up this step, instead of using KEGGREST, we are going to use a relational table of IDs available at the FTP site of the EBI public databases.
# Downloading the relational table
= 'https://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/database_accession.tsv'
url = 'chebi2kegg.tsv'
destfile download.file(url, destfile = destfile, method = "auto")
# Reading the table
= fread("chebi2kegg.tsv") chebi2kegg_df
# Generate a subtable of ChEBI accessions having KEGG compound accessions
= chebi2kegg_df[grepl('KEGG COMPOUND accession', chebi2kegg_df$TYPE), ]
ch2co
# Extract compound IDs from clean metabolights table
= gsub("CHEBI:", "", compound_clean$'database_identifier')
chebi_ids
# Filter ch2co for relevant rows using chebi_ids
= ch2co[ch2co$COMPOUND_ID %in% chebi_ids, ]
filtered_ch2co
# Extract unique KEGG IDs from filtered_ch2co
= unique(filtered_ch2co$ACCESSION_NUMBER)
unique_kegg_ids
# Saving unique_kegg_ids as a vector
= unique_kegg_ids
cpd_vector length(cpd_vector)
3.3. Drawing!
- We will use the list of pathways having at least 10 genes to draw. Note that not all of them have compounds data. As we are plotting absence/presence, we set the number of bins = 2, the scale in one direction, and use 1 as limit. This step takes 3 min to run and generates a bunch of files on the working directory that we will clean on the next step.
- Cleaning working directory.
if(!dir.exists("output_plots/metabolites")){
dir.create("output_plots/metabolites")
}
file.copy(from=list.files(pattern="./*pathview.png"), to="./output_plots/metabolites/", overwrite = TRUE)
= list.files(path = ".", pattern = "*.png")
png_files = list.files(path = ".", pattern = "*.xml")
xml_files = c(png_files, xml_files)
files unlink(files)
- Here are two examples of Pathview outputs that showcase the intersection between active genes and the presence of compounds. Please note that we have intentionally refrained from filtering out genes or compounds based on abundance or quality for these specific examples. Our primary objective is to generate visualization material that effectively highlights these intersections. The rest of the generated figures are stored at
output_plots/metabolites/
directory. You can explore the files by clicking on them at the left-side panel.
display_png(file="./output_plots/metabolites/ko00400.pathview.png")
display_png(file="./output_plots/metabolites/ko02060.pathview.png")
# Get pathview help
?pathview
### References:
Datasets and databases papers:
Honeybee datasets (used in parts 1 and 2 of this notebook): https://doi.org/10.1073/pnas.2000228117
Metabolites dataset (used in part 3): https://doi.org/10.1101/2022.11.09.515806
MGnify pipeline: https://doi.org/10.1093/nar/gkac1080
KEGG database: https://doi.org/10.1093/nar/gkw1092
Metabolights database: https://doi.org/10.1093/nar/gkz1019
R libraries:
library(ALDEx2)
Gloor GB, Macklaim JM, Fernandes AD (2016). Displaying Variation in Large Datasets: a Visual Summary of Effect Sizes. Journal of Computational and Graphical Statistics, 2016 http://doi.org/10.1080/10618600.2015.1131161. R package version 1.30.0.library(data.table)
Matt Dowle and Arun Srinivasan (2023). data.table: Extension ofdata.frame
. R package version 1.14.8. https://CRAN.R-project.org/package=data.tablelibrary(dplyr)
Hadley Wickham, Romain François, Lionel Henry, Kirill Müller and Davis Vaughan (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.2. https://CRAN.R-project.org/package=dplyrlibrary(IRdisplay)
Thomas Kluyver, Philipp Angerer and Jan Schulz (NA). IRdisplay: ‘Jupyter’ Display Machinery. R package version 1.1. https://github.com/IRkernel/IRdisplaylibrary(KEGGREST)
Dan Tenenbaum and Bioconductor Package Maintainer (2021). KEGGREST: Client-side REST access to the Kyoto Encyclopedia of Genes and Genomes (KEGG). R package version 1.38.0.library(MGnifyR)
Ben Allen (2022). MGnifyR: R interface to EBI MGnify metagenomics resource. R package version 0.1.0.library(pathview)
Luo, W. and Brouwer C., Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics, 2013, 29(14): 1830-1831, doi: 10.1093/bioinformatics/btt285. R package version 1.38.0.library(tidyjson)
Jeremy Stanley and Cole Arendt (2023). tidyjson: Tidy Complex ‘JSON’. R package version 0.3.2. https://CRAN.R-project.org/package=tidyjson
Going deeper:
If you want to learn more about using MGnifyR, you can follow the online tutorial available here: https://www.ebi.ac.uk/training/online/courses/metagenomics-bioinformatics/mgnifyr/
KEGGREST documentation with multiple examples: https://www.bioconductor.org/packages/devel/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html
Pathview user manual: https://www.bioconductor.org/packages/release/bioc/vignettes/pathview/inst/doc/pathview.pdf